Models that perform only slightly better than random guessing are called weak learners (Bishop 2007).
Weak learners low predictive accuracy may be due, to the predictor having high bias or high variance.
In many situations trees may be a good option (e.g. for simplicity and interpretability)
But there are issues that, if solved, may improve performance.
It is to be noted, too, that these problems are not unique to trees.
How can a model be made less variable or less biased without this increasing its bias or variance?
There are distinct appraches to deal with this problem
Ensemble learning takes a distinct based on “the wisdom of crowds”.
Predictors, also called, ensembles are built by fitting repeated (weak learners) models on the same data and combining them to form a single result.
As a general rule, ensemble learners tend to improve the results obtained with the weak learners they are made of.
If we rely on how they deal with the bias-variance trade-off we can consider distinct groups of ensemble methods:
Bagging
Boosting
Hybrid learners
Boosting or Stacking combine distinct predictors to yield a model with an increasingly smaller error, and so reduce the bias.
They differ on if do the combination
sequentially (1) or
using a meta-model (2) .
Hybrid techniques combine approaches in order to deal with both variance and bias.
The approach should be clear from their name:
Gradient Boosted Trees with Bagging
Stacked bagging
Decision trees suffer from high variance when compared with other methods such as linear regression, especially when \(n/p\) is moderately large.
Given that this is intrinsic to trees, Breiman (1996) sugested to build multiple trees derived from the same dataset and, somehow, average them.
Bagging relies, informally, on the idea that:
That is, relying on the sample mean instead of on simple observations, decreases variance by a factor of \(n\).
BTW this idea is still (approximately) valid for more general statistics where the CLT applies.
Two questions arise here:
The bootstrap has been applied to almost any problem in Statistics.
We illustrate it with the simplest case: estimating the standard error (that is the square root of the variance) of an estimator.
Assume we want to estimate some parameter \(\theta\), that can be expressed as \(\theta (F)\), where \(F\) is the distribution function of each \(X_i\) in \((X_1,X_2,...,X_n)\).
For example:
\[\begin{eqnarray*} \theta &=& E_F(X)=\theta (F) \\ \theta &=& Med(X)=\{m:P_F(X\leq m)=1/2\}= \theta (F). \end{eqnarray*}\]
\[\begin{eqnarray*} \hat{\theta}&=&\overline{X}=\int XdF_n(x)=\frac 1n\sum_{i=1}^nx_i=\theta (F_n) \\ \hat{\theta}&=&\widehat{Med}(X)=\{m:\frac{\#x_i\leq m}n=1/2\}=\theta (F_n) \end{eqnarray*}\]
An key point, when computing an estimator \(\hat \theta\) of a parameter \(\theta\), is how precise is \(\hat \theta\) as an estimator of \(\theta\)?
With the sample mean, \(\overline{X}\), the standard error estimation is immediate because the variance of the estimator is known: \(\sigma_\overline{X}=\frac{\sigma (X)}{\sqrt{n}}\)
So, a natural estimator of the standard error of \(\overline{X}\) is: \(\hat\sigma_\overline{X}=\frac{\hat{\sigma}(X)}{\sqrt{n}}\)
\[ \sigma _{\overline{X}}=\frac{\sigma (X)}{\sqrt{n}}=\frac{\sqrt{\int [x-\int x\,dF(x)]^2 dF(x)}}{\sqrt{n}}=\sigma _{\overline{X}}(F) \]
then, the standard error estimator is the same functional applied on \(F_n\), that is:
\[ \hat{\sigma}_{\overline{X}}=\frac{\hat{\sigma}(X)}{\sqrt{n}}=\frac{\sqrt{1/n\sum_{i=1}^n(x_i-\overline{x})^2}}{\sqrt{n}}=\sigma _{\overline{X}}(F_n). \]
The bootstrap method makes it possible to do the desired approximation: \[\hat{\sigma}_{\hat\theta} \simeq \sigma _{\hat\theta}(F_n)\] without having to to know the form of \(\sigma_{\hat\theta}(F)\).
To do this,the bootstrap estimates, or directly approaches \(\sigma_{\hat{\theta}}(F_n)\) over the sample.
The bootstrap allows to estimate the standard error from samples of \(F_n\), that is,
Substituting \(F_n\) by \(F\) carried out in the sampling step.
\[\begin{eqnarray*} &&\mbox{Instead of: } \\ && \quad F\stackrel{s.r.s}{\longrightarrow }{\bf X} = (X_1,X_2,\dots, X_n) \, \quad (\hat \sigma_{\hat\theta} =\underbrace{\sigma_\theta(F_n)}_{unknown}) \\ && \mbox{It is done: } \\ && \quad F_n\stackrel{s.r.s}{\longrightarrow }\quad {\bf X^{*}}=(X_1^{*},X_2^{*}, \dots ,X_n^{*}) \quad (\hat \sigma_{\hat\theta}= \hat \sigma_{\hat \theta}^* \simeq \sigma_{\hat \theta}^*). \end{eqnarray*}\]
Here, \(\sigma_{\hat \theta}^*\) is the bootstrap standard error of \(\hat \theta\) and
\(\hat \sigma_{\hat \theta}^*\) the bootstrap estimate of the standard error of \(\hat \theta\).
That is, the new (re-)sampling process consists of extracting samples of size \(n\) of \(F_n\):
\({\bf X^{*}}=(X_1^{*},X_2^{*},\dots ,X_n^{*})\) is a random sample of size \(n\) obtained with replacement from the original sample \((X_1,X_2,\dots ,X_n)\).
Samples \({\bf X^*}\), obtained through this procedure are called bootstrap samples or re-samples.
\[\begin{eqnarray*} \mathcal {L}(\hat \theta)&\simeq& P_F(\hat\theta \leq t): \mbox{Sampling distribution of } \hat \theta,\\ \mathcal {L}(\hat \theta^*)&\simeq& P_{F_n}(\hat\theta^* \leq t): \mbox{Bootstrap distribution of } \hat \theta, \end{eqnarray*}\]
This distribution is usually not known.
However the sampling process and the calculation of the statistics can be approximated using a Monte Carlo Algorithm.
\[ \mbox{if }B\rightarrow\infty \mbox{ then } \hat{\sigma}_B (\hat\theta) \rightarrow \hat\sigma_{\infty} (\hat\theta) =\sigma_B(\hat\theta)=\sigma_{\hat\theta}(F_n). \]
The bootstrap approximation, \(\hat{\sigma}_B(\hat\theta)\), to the bootstrap SE, \(\sigma_B(\hat\theta)\), provides an estimate of \(\sigma_{\hat\theta}(F_n)\):
\[ \hat{\sigma}_B(\hat\theta)(\simeq \sigma_B(\hat\theta)=\sigma_{\hat\theta}(F_n))\simeq\hat \sigma_{\hat\theta}(F_n). \]
From real world to bootstrap world:
\[\hat f_{bag}(x)=\frac 1B \sum_{b=1}^B \hat f^{*b}(x) \]
\[ \hat G_{bag}(x) = \arg \max_k \hat f_{bag}(x). \]
Since each out-of-bag set is not used to train the model, it can be used to evaluate performance.
Source: https://www.baeldung.com/cs/random-forests-out-of-bag-error
AmesHousing dataset on house prices in Ames, IA.
We use libraries:
rpart for stratified resamplingipred for bagging.A complementary way to interpret a tree is by quantifying how important is each feature.
Done measuring the total reduction in loss function associated with each variable across all splits.
This measure can be extended to an ensemble simply by adding up variable importance over all trees built.
caretvip function from the vip package can be used (see lab examples).Random Forests Algorithm, from chapter 17 in (Hastie and Efron 2016)
# number of features
n_features <- length(setdiff(names(ames_train), "Sale_Price"))
# train a default random forest model
ames_rf1 <- ranger(
Sale_Price ~ .,
data = ames_train,
mtry = floor(n_features / 3),
respect.unordered.factors = "order",
seed = 123
)
# get OOB RMSE
(default_rmse <- sqrt(ames_rf1$prediction.error))
## [1] 24859.27There are several parameters that, appropriately tuned, can improve RF performance.
1 and 2 tend to have the largest impact on predictive accuracy.
The idea of improving weak learners by aggregation has moved historically along two distinct lines:
Idea: create a model that is better than any of its individual components by combining their strengths and compensating for their weaknesses.
For this, multiple weak models are trained sequentially, and each new model is trained to improve the errors made by the previous model.
The final model is a weighted combination of all the models, and the weights are determined by the accuracy of each model.
Boosting, like other Ensemble methods, improves the accuracy of weak learners and achieve better predictive performance than individual models.
Boosting also reduces overfitting by improving the generalization ability of models.
Available in many flavors,
Can be parallelized
Strong experience in Real world applications and industry.
Can be computationally expensive, especially when dealing with large datasets and complex models.
Can be sensitive to noisy data and outliers,
May not work well with certain types of data distributions.
Not so good as “out-of-the-box”: Requires careful tuning of hyperparameters to achieve optimal performance, which can be time-consuming and challenging.
The main steps in Gradient Boosting are:
Repeat steps 2-3 for a specified number of iterations or until convergence.
Multiple extensions from Gradient Boosting.
XGBoost
LightGBM
Fraud Detection
Image and Speech Recognition
Anomaly Detection
Medical Diagnosis
Amazon’s recommendation engine
Models that predict protein structures from amino acid sequences
Pattern identification in fMRI brain scans.